Sbvrldnacomp:an Effective Dna Sequence Compression Algorithm
نویسندگان
چکیده
There are plenty specific types of data which are needed to compress for easy storage and to reduce overall retrieval times. Moreover, compressed sequence can be used to understand similarities between biological sequences. DNA data compression challenge has become a major task for many researchers for the last few years as a result of exponential increase of produced sequences in gene databases. In this research paper we have attempt to develop an algorithm by self-reference bases; namely Single Base Variable Repeat Length DNA Compression (SBVRLDNAComp). There are a number of reference based compression methods but they are not satisfactory for forthcoming new species. SBVRLDNAComp is an optimal solution of the result obtained from small to long, uniform identical and non-identical string of nucleotides checked in four different ways. Both exact repetitive and non-repetitive bases are compressed by SBVRLDNAComp. The sound part of it is without any reference database SBVRLDNAComp achieves 1.70 to 1.73 compression ratio α after testing on ten benchmark DNA sequences. The compressed file can be further compressed with standard tools (such as WinZip or WinRar) but even without this SBVRLDNAComp outperforms many standard DNA compression algorithms.
منابع مشابه
Wavelet Based Lossless DNA Sequence Compression for Faster Detection of Eukaryotic Protein Coding Regions
Discrimination of protein coding regions called exons from noncoding regions called introns or junk DNA in eukaryotic cell is a computationally intensive task. But the dimension of the DNA string is huge; hence it requires large computation time. Further the DNA sequences are inherently random and have vast redundancy, hidden regularities, long repeats and complementary palindromes and therefor...
متن کاملReference Sequence Construction for Relative Compression of Genomes
Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform and also supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In...
متن کاملGenomeCompress: A Novel Algorithm for DNA Compression
The genome of an organism contains all hereditary information encoded in DNA. So it is extremely important to sequence the genome which determines how the organisms survive, develop and multiply. Since three decades, due to massive efforts on DNA sequencing, complete genome sequence of a large number of organisms including humans are now known and the genomic databases are growing exponentially...
متن کاملEstimating effective DNA database size via compression
Search for sequence similarity in large-scale databases of DNA and protein sequences is one of the essential problems in bioinformatics. To distinguish random matches from biologically relevant similarities, it is customary to compute statistical P-value of each discovered match. In this context, P-value is the probability that a similarity with a given score or higher would appear by chance in...
متن کاملA Better and Efficient DNA DATA COMPRESSOR by Fusion of Symbolical and ARLE Technique
Data compression is concerned with how information is organized in data. The size and importance of these databases will be bigger and bigger in the future; therefore this information must be stored or communicated efficiently though there are many text compression algorithms, they are not well suited for the characteristics of DNA sequences. There are algorithms for DNA compression which takes...
متن کامل